Rhizobia–diatom symbiosis fixes missing nitrogen in the ocean

Nitrogen (N2) fixation in oligotrophic surface waters is the main source of new nitrogen to the ocean1 and has a key role in fuelling the biological carbon pump2. Oceanic N2 fixation has been attributed almost exclusively to cyanobacteria, even though genes encoding nitrogenase, the enzyme that fixes N2 into ammonia, are widespread among marine bacteria and archaea3–5. Little is known about these non-cyanobacterial N2 fixers, and direct proof that they can fix nitrogen in the ocean has so far been lacking. Here we report the discovery of a non-cyanobacterial N2-fixing symbiont, ‘Candidatus Tectiglobus diatomicola’, which provides its diatom host with fixed nitrogen in return for photosynthetic carbon. The N2-fixing symbiont belongs to the order Rhizobiales and its association with a unicellular diatom expands the known hosts for this order beyond the well-known N2-fixing rhizobia–legume symbioses on land6. Our results show that the rhizobia–diatom symbioses can contribute as much fixed nitrogen as can cyanobacterial N2 fixers in the tropical North Atlantic, and that they might be responsible for N2 fixation in the vast regions of the ocean in which cyanobacteria are too rare to account for the measured rates.


Initial Ca. T. diatomicola genome reconstruction
Ca. T. diatomicola nifH genes were first identified in a read-level analysis of the PacBio long-reads using the "sqm_longreads.pl"script, which is part of the SqueezeMeta metagenomics pipeline 53 v1.6.2.This script performed DIAMOND 50 v2.0.8 blastx searches against the NCBI-nr 55 (release 247), KEGG 51 v58 and eggNOG 52 v4.5 databases.The predicted protein sequences were then searched for the presence of the full-length Gamma-A nifH 97 using blastp 41 , resulting in a number of matching proteins (> 99% sequence identity over the full length).One sample that contained multiple Gamma-A nifH encoding reads was then chosen for an in-depth analysis (MSM89-S1, 3-10 µm size fraction).The PacBio long-reads of this sample were assembled into contigs using the long-read metagenome assembler hifiasm-meta 69 v0.2-r043.To reduce the loss of sequence information, unassembled reads were added to the contigs.In detail, long reads were mapped to contigs using minimap2 (ref. 42) v2.22-r1101 and mapped reads were removed from the mapping file using SAMtools 43 ('samtools view -f').The fasta sequences of the remaining unmapped reads were subsequently extracted using the script reformat.shv38.70 (BBMap -Bushnell B.sourceforge.net/projects/bbmap/) and added to the assembled contigs.The combined contigs and reads were then analyzed using the SqueezeMeta metagenomics pipeline 53 v1.6.2 including taxonomic and functional annotation of all reads and contigs as well as binning into metagenome-assembled genomes (MAGs).
Metagenomic short reads from the two stations in which the Hyphomicrobiaceae MAG had highest abundance (MSM161-S4 and S7) were then co-assembled using MEGAHIT 39 v1.2.9.The assembled contigs were analyzed using the SqueezeMeta metagenomics pipeline 53 v1.6.2 as described above.The resulting MAGs were then searched for the presence of a MAG that was similar to the previously identified low-quality Hyphomicrobiaceae MAG using fastANI 70 v1.33,revealing a better-quality MAG (total size of ~1.1 Mb, ~50% completeness, 0.5% redundancy), which was also assigned to the Hyphomicrobiaceae.This MAG was then iteratively extended and refined using a custom script described in the main Methods section (see code availability statement), resulting in an improved MAG of ~1.25 Mb, 76% completeness and 0% redundancy based on CheckM2 (ref. 46) v0.1.2estimates, which is particularly accurate for organisms with reduced genomes.The contigs of this improved MAG were used as queries during the reconstruction of the final Hyphomicrobiaceae MAG, ultimately called Ca. T. diatomicola.

Recovery of Haslea spp. in metagenomic contigs
To investigate the presence of Haslea spp.diatoms in our metagenomes, the large assembly (contigs > 1 Kb) generated for the recovery of the final Ca. T. diatomicola genome was annotated using the SqueezeMeta metagenomics pipeline 53 v1.6.2 as described above.
Using the contig consensus annotation, 17 contigs assigned to the genus Haslea were identified (Supplementary Table 2).At the time of our analysis, the NCBI protein database contained < 3,000 sequences from only a few Haslea species, and almost all of them represented chloroplast or mitochondrial sequences.It is thus likely that our assembly contained more Haslea contigs, which escaped detection.

Abundance of cyanobacterial and heterotrophic N2 fixers in the tropical North Atlantic
The relative abundance of eight cyanobacterial and forty heterotrophic N2 fixers 5,101 as well as the two Ca.Tectiglobus genomes was assessed in the metagenomes we generated from the tropical North Atlantic.Relative abundance of genomes in the metagenomes was calculated based on mapped reads using CoverM v0.6.1 (https://github.com/wwood/CoverM),requiring ≥ 95% sequence identity and ≥ 80% of the read to align.

Phylogenetic analyses
A maximum likelihood phylogenetic tree of Ca. T. diatomicola and Ca.T. profundi within Rhizobiales was calculated based on sixteen ribosomal proteins 57 as follows: 2,376 Rhizobiales and 55 Parvibaculales (outgroup) genomes were retrieved from the genome taxonomy database (GTDB) 16 version r214.To this collection, the genomes of Ca. T. diatomicola and Ca.
To calculate the maximum likelihood tree of NifH proteins, NifH amino acid sequences from the 85,205 GTDB species representatives 103 (version r214) were identified using an alignment score ratio approach as previously described 104 .In brief, the protein complement of the 85,205 genomes was used as query in a DIAMOND 50 v2.0.8 search against a self-curated database of NifH sequences (curated from Méheust et al. 105 by removing long sequences).
Subsequently, the maximum alignment score of all sequences that had a hit was calculated.To identify the NifH sequences, the ratio of alignment score against the database and maximum alignment score was used.NifH sequences were then aligned using muscle 58 v3.8.1551 and a preliminary phylogenetic tree was calculated using FastTree 59 v2.1.11with default parameters.This preliminary phylogenetic tree was used to discern group I NifH proteins from those that belong to other groups 105 .For the final phylogenetic analysis, group I NifH sequences obtained from the Pseudomonadota, the Ca.T. diatomicola NifH from the Marine Atlas of Tara Ocean Unigenes (MATOU-v1_23614344) 97 , the NifH from the Ca.T. diatomicola and Ca.T. profundi genomes as well as the cyanobacterial NifH sequences from the Méheust et al. 105 reference dataset (outgroup) were used.The selected sequences were aligned using MAFFT 72 v7.505 ("-auto").The alignment was trimmed using trimAl 73 v1.4.1.A maximum likelihood phylogenetic tree was calculated using IQ-TREE 76 v2.2.0.3 with automated model selection using ModelFinder 74 including 1000 ultrafast bootstraps using UFBoot2 (ref. 75).The resulting tree was visualized in iTOL 60 v6.8.1.For NifDKENBS, phylogenetic trees were created the same way as for NifH, using proteins from the same Pseudomonadota genomes included in the NifH phylogeny.The NifDKENBS proteins were identified in these genomes based on homology to the respective Ca.T. diatomicola proteins using the alignment score ratio approach described above.Because NifEN are encoded as one large fusion protein in some genomes (both Ca. Tectiglobus genomes, the closest nifH relative and some cyanobacterial genomes), they were manually separated into the NifE and NifN domains based on domain annotation using the InterPro web server 54 v98.0 (https://www.ebi.ac.uk/interpro/result/InterProScan).These separated NifE and NifN domains were then used for creating the phylogenetic trees.
To calculate the maximum likelihood tree of CcoN proteins, CcoN amino acid sequences were identified and retrieved from the protein complement of the 85,205 GTDB species representatives 103 (version r214), using a curated reference database of CcoN sequences from Murali et al. 106 .The identified CcoN sequences were clustered at 70% sequence identity using USEARCH 71 v11.0.667 and the Ca.T. diatomicola and Ca.T. profundi CcoN sequences were added to the dataset.The clustered sequence set was aligned using muscle 58 v3.8.1551 and a preliminary phylogenetic tree was calculated using Fasttree 59 v2.1.11followed by selection of sequences representing the broad phylogenetic neighborhood of the Ca.T. diatomicola CcoN.
A phylogenetic tree of these selected sequences was calculated the same way as the preliminary phylogenetic tree.The 92 sequences representing the close phylogenetic neighborhood of the Ca.T. diatomicola CcoN sequence were then selected for the final CcoN phylogenetic analysis.The final CcoN phylogenetic tree was calculated with IQ-TREE 76 v2.2.2.7, using the LG+F+I+G4 model after evaluation with ModelFinder 74 including 1000 ultrafast bootstraps using UFBoot2 75 .The resulting tree was visualized in iTOL 60 v6.8.1.

Bulk rates of CO2 and N2 fixation
Rates of CO2 and N2 fixation were determined following the approach of Großkopf et al. 85 and Martínez-Pérez et al. 10 .Stable isotope incubations were carried out in triplicates in 4.7-L transparent polycarbonate bottles (Nalgene; cleaned with hydrochloric acid prior to the cruise).
These were amended with 13 C-bicarbonate (NaH 13 CO3; ≥ 98 at% 13 C; Sigma-Aldrich) and 15 N-N2 (Cambridge Isotope Laboratories, obtained via Eurisotop; lot numbers: I-21065/AR0664758 (MSM89), I-19197/AR0586172 and I-21065/AR0664758 (M161)) to reach enrichment levels of 5.7-6.8(Mean= 6.1, SD= 0.3; both cruises combined) at% for 13 C in the dissolved inorganic carbon (DIC) pool and 9.1-14.0(Mean= 11.5, SD= 1.6; MSM89 cruise) and 4.5-8.1 (Mean= 6.4,SD= 1.1; M161 cruise) at% for 15 N in the N2 pool.The 15 N-N2 gas was checked for 15 N-NH4 + contamination using the hypobromite method 107 prior to the cruises, and no contamination was detected.Heavy stable isotope enrichment of each substrate pool was determined from subsamples taken at the end of the incubations from every isotope-amended bottle using membrane inlet mass spectrometry (for 15 N in the N2 pool) and cavity ring-down spectroscopy (for 13 C in the DIC pool).The 15 N2 gas was added using a modified bubble method 108,109 for which the bottles were gently agitated for 15-20 min prior to bubble removal.Bottles were incubated headspace-free in on-deck incubators, which were continuously flushed with surface seawater under simulated ambient light conditions 86 (36% light transmission, Lee filter, no.724, 'Ocean Blue' for surface waters).After ~ 24 h, 3.5 L of incubated seawater were filtered onto pre-combusted (4-6 h at 450 °C) GF/F filters (Whatman; diameter 25 mm) for bulk elemental and isotopic analyses, and 100-120 mL were preserved for FISH and nanoSIMS analyses (see below).GF/F filters were dried (55-65 °C overnight) on board and/or frozen at -20 °C until further processing.The elemental and isotopic composition of the collected biomass was determined using an elemental analyzer (Thermo Flash EA, 1112 Series) coupled to a continuous-flow isotope ratio mass spectrometer (Delta Plus XP IRMS; Thermo Fisher Scientific) after GF/F filters were decalcified overnight in an acidic atmosphere (using fuming hydrochloric acid).CO2 and N2 fixation rates were subsequently calculated based on the incorporation of 13 C and 15 N into biomass (i.e. the change in isotopic composition) according to Großkopf et al. 85 and Martínez-Pérez et al. 10 .Natural abundances for the calculations were obtained from unamended single control bottles incubated alongside the isotope-amended bottles.These samples were processed the same way as described above.All rates are reported as averages of triplicates.Samples for FISH (from the start of incubations and depth profiles) and nanoSIMS were fixed using methanol-free paraformaldehyde solution (1% w/v final concentration) either for ~ 24 h at 4 °C or for a few hours at 4 °C followed by 0.5 h at room temperature.Fixed FISH and nanoSIMS samples were subsequently filtered onto polycarbonate and gold (Au)-coated polycarbonate filters (Isopore, 0.2 µm pore size, 25 mm diameter), respectively.All filters were subsequently rinsed with ultrapure water (MilliQ) before being dried and stored at -20 °C until further analyses.

Design and optimization of FISH-probes targeting T. diatomicola
For fluorescence in situ hybridization (FISH) probe design, the Ca.T. diatomicola 16S rRNA sequence obtained in this study was imported into ARB 110 (version arb-devel-7.1.rev19270)and aligned to the SILVA_138.1_SSURef_NR99_12_06_20database 111 using the SINA aligner integrated into ARB 110 .FISH probes were designed using the probe design tool in ARB and evaluated in silico using MathFISH 112 .Unlabeled competitor probes were designed for non-target organisms, if the difference between the in silico-predicted formamide melting concentrations of target and non-target organisms was less than 20% (ref. 112).In addition, for probes that bind to regions of the 16S rRNA with low accessibility 113 , unlabeled helper probes (with equal or higher in silico predicted formamide melting concentration) were designed to enhance probe binding to the target organisms.The newly designed probes were tested and optimized using Clone-FISH 89 .For this, paraformaldehyde-fixed E. coli cells (JM109(DE3)) containing the full-length 16S rRNA gene sequence of Ca. T. diatomicola on a peT-23a(+) plasmid (obtained from GenScript) were prepared as previously described 89 .Formamide melting curves 114 were performed using the newly designed probes in the form of double-Atto488-labelled probes (DOPE-FISH 115 ), together with all corresponding unlabeled competitor and/or helper probes 116 .Fixed E. coli cells were applied to Teflon-coated glass microscope slides (Marienfeld, Germany), dried and dehydrated using increasing concentrations of ethanol.FISH was performed using 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60 and 70% formamide in the hybridization buffer, as previously described 116 .For each formamide concentration, at least four images were acquired on an Axio Imager 2 microscope (Zeiss) of random fields of view using identical exposure settings for all formamide concentrations.Formamide melting curves were analyzed using daime 117 v2.2.2 (see Extended Data Table 2 for optimal formamide concentrations).

Scanning electron microscopy (SEM)
To identify and characterize the host of Ca. T. diatomicola, diatoms containing Ca. T. diatomicola cells were visualized by scanning electron microscopy (SEM) with a FEI Quanta 250 FEG ESEM (Thermo Fisher Scientific, FEI).Secondary electron micrographs were taken using a primary electron beam with an acceleration voltage of 2 kV and an Everhart-Thornley detector for the detection of the secondary electrons on Au-coated filters.For non-coated filters, the low vacuum mode of the SEM was used for imaging with a pressure of around 9e -1 mbar and an acceleration voltage of 5 kV for the primary electrons.The secondary electrons micrographs were taken with a Large Field Gaseous (LFD) detector.

Single-cell activities using nanoSIMS
Single-cell C and N isotopic composition of Ca. T. diatomicola and its Haslea hosts as well as Richelia and its diatom hosts were determined using nanoscale secondary ion mass spectrometry (nanoSIMS; nanoSIMS50L, CAMECA).Filter areas containing marked diatoms were pre-sputtered with a Cs + ion beam (~300 pA) to remove surface contamination and the outer silica layer of the diatom.Samples were then measured with a Cs + ion beam intensity of 1-1.5 pA and a beam diameter of ~100 nm.Each analysis was carried out with a dwell time of 1 ms per pixel and raster sizes of 20 µm x 20 µm to 40 µm x 40 µm, each with 256 pixel x 256 pixel.The secondary ions 12 C -, 13 C -, 12 C 14 N -, 12 C 15 N -, 31 P -and 32 S -were recorded simultaneously in up to 60 planes for each measurement.Look@nanoSIMS 118 v2018 was used to process the data, including dead-time and drift correction, and accumulating the recorded planes.The isotopic ratios ( 13 C/ 12 C and 12 C 15 N/ 12 C 14 N) of regions of interest were determined by overlaying epifluorescence images of the FISH-positive cells and their host diatom (Ca.T. diatomicola) or autofluorescence images (diatom-associated Richelia) with the nanoSIMS secondary electron images.Only cells with a Poisson error of less than 5% were considered for further analyses.Ratios of 15 N/ 14 N (from 12 C 15 N/ 12 C 14 N) and 13 C/ 12 C were then used to compute the cell specific N and C assimilation rate of individual cells following the equation below: Cellular rate (C or N cell -1 d -1 ) = (at%cell -at%control) / (at%CO2 or N2 -at%control) * Ccell or Ncell * 1/t Where at%cell, at%control, at%CO2 or N2 are the atom% 13 C or 15 N in the individual cell, the natural abundance/control atom% 13 C or 15 N from the bulk incubation, the atom% 13 C or 15 N in the DIC or N2 substrate pools, respectively; Ccell or Ncell are the C or N biomass per cell, respectively, and t is the incubation time in days.The cellular C biomass was determined from biovolumes and a biovolume-to-biomass conversion.Biovolumes were computed from the size measurements obtained during SEM analysis (for diatom hosts of Ca. T. diatomicola; height was assumed as '0.6 x width' 119 ), nanoSIMS analysis (for Ca.T. diatomicola and Richelia) or brightfield/epifluorescence microscopy (Richelia hosts).The prolate spheroid model and the prism-on-elliptic base model were used for Ca.T. diatomicola and its host, respectively 120 .
Richelia and their hosts were processed using the prolate spheroid and cylinder+2 cones models 120 .Biovolumes were subsequently converted to C biomass using equations provided by Verity et al. 121 for diatom hosts and Richelia and Khachikyan et al. 122 for Ca.T. diatomicola, which takes into account the higher C density of smaller cells.Cellular C biomass was converted to cellular N biomass using the Redfield ratio of 6.6:1 for C:N.
In order to determine how much of the fixed C and N remained inside the hosts and symbionts within each Ca.T. diatomicola-Haslea symbiosis, the total fixation rates were mass-

Detailed description of diatoms hosting Ca. T. diatomicola
Fluorescence and SEM images were used to identify and characterize the diatoms containing Ca. T. diatomicola cells.The frustules available for examination were not well preserved with most frustules collapsed into a more or less two-dimensional structure (so that occasionally both external and internal features were visible in the same frustule) (Extended Data Fig. 4).However, they were clearly identified as naviculoid pennate diatoms likely belonging to the genus The observed cells varied in length between 19.8 and 57.8 µm (Mean= 32.2, Standard deviation = 12.7, n= 9), with a central transapical axis of between 3.2 and 8.1 µm (Mean= 4.9, Standard deviation = 1.4).They contained two lobed chloroplasts on either side of the central nucleus with the lobes running in parallel to the apical axis (Extended Data Fig. 4f).All cells were lanceolate in shape with acute apices.As described by Round et al. 123 , in some specimens the valve apices appeared slightly twisted along the apical axis.Rows of uniseriate transapical striae, were visible, with about 30 transapical stria in 10 µm.This is within a range also described for other Haslea species 25 .These structures were overlaid externally by longitudinal strips.
The raphe fissure was bordered internally by ridges.The ridges extended along most of the raphe but terminated just before the valve pole (Extended Data Fig. 4h).The ribs along the raphe were slightly asymmetric in the central valve area (arrow and arrowhead in Extended Data Fig. 4g and j, respectively).These structures were clearly discernable.However, the raphe slit itself and the raphe endings were mostly not seen in the samples (the latter possibly obscured or modified by the symbiont, the former is usually hidden by the bordering ridges).
Round et al. 123 have described thickened central costae in epipelic members of Haslea but mentioned that in planktonic forms these seem to be absent.As the specimen examined in the present study were all planktonic, the absence of such thickened areas would not contradict their assignment to the genus Haslea.Haslea species without this characteristic have also been described by other authors (e.g.ref. 25 ).

Identification of Ca. T. profundi nif genes
The closest relative of Ca. T. diatomicola, Ca.T. profundi, was identified from a metagenome recovered from sediment trap particles collected at station ALOHA in the North Pacific (GCA-013214245) (ref. 99).Similar to Ca. T. diatomicola, the genome of Ca. T. profundi was reduced in size and GC content (Extended Data Fig. 3), suggesting a similar obligate symbiotic lifestyle.Unlike in Ca.T.diatomicola though, initially no nif genes were identified in the genome of Ca. T. profundi.Thus, in an effort to recruit nif genes potentially missed in the original study, the Ca.T. profundi genome (GCA-013214245) was iteratively extended and refined.This led to the recruitment of nif genes (Supplementary Table 5) including a nifH that is 95% identical to the one of Ca. T. diatomicola (Extended Data Fig. 2).An identical nifH was also found in a MAG assigned to Micavibrionaceae (GCA-013215815), derived from the same sediment trap 99 .Manual inspection of the Micavibrionaceae MAG using anvi'o 40 v7.1 revealed that the nifH-encoding contig was incorrectly assigned to this MAG.The refined Ca.T. profundi balanced as follows: B(total) = B(host) + A * B(symb.) with B equals biomass as either carbon (C) or nitrogen (N), B(total) in fmol C or N symbiosis -1 d -1 and B(host) or B(symb) in fmol C or N cell -1 d -1 , and A as the average number of symbionts per host.Solving for either host or symbiont, the amount of fixed C and N can be expressed as % recovered in either host or symbiont.Cellular rates and abundances were combined to determine the absolute contributions of the Ca.T. diatomicola-Haslea symbiosis and diatom-associated Richelia to the bulk N2 fixation rate as follows: Nfixgroup = Nfixcell * Abundancecell where Nfixgroup, Nfixcell, and Abundancecell are the group-specific contribution, the cellular rate (either Ca.T. diatomicola-Haslea symbiosis or Richelia symbioses), and the abundance (of either Ca.T. diatomicola-Haslea symbiosis or Richelia symbioses), respectively.For the Ca.T. diatomicola-Haslea symbiosis and the diatom-associated Richelia, the amount of N recovered in the diatom hosts was also taken into account.